(19) 



J 



Europaisches Patentamt 
European Patent Office 
Office europeen des brevets 




(11) 



EP 1 248 466 A1 



(12) 



EUROPEAN PATENT APPLICATION 

published in accordance with Art. 158(3) EPC 





uaie or puDiication. 


(51) IntCI. 7 : H04N 7/24 




no 1 ft onno RitiiAtin onno/zi-i 
u». i u.zuuz Duiieiin 1 


(21) 




(86) International application number: 


Application number: 01902702.8 


PCT/JP01/00662 




LJdie oi wing, oi.ui.zuui 


(87) International publication number: 






WO 01/078399 (18.10.2001 Gazette 2001/42) 


(84) 


Designated Contracting States: 


(72) Inventors: 




AT BE CH CY DE DK ES R FR GB GR IE IT LI LU 


• VETRO, Anthony 




MC NL PT SETR 


Staten Island, NY 10314 (US) 




Designated Extension States: 


• DIVAKARAN, Ajay 




AL LT LV MK RO SI 


Denville, NJ 07834 (US) 


(30) 




• SUN, Huifang 


Priority: 11.04.2000 US 546717 


Cranbury, NJ 08572 (US) 


(71) 


Applicant: MITSUBISHI DENKI KABUSHIKI 


(74) Representative: Pfenning, Meinig & Partner 




KAISHA 


Mozartstrasse 1 7 




Tokyo 100-8310 (JP) 


80336 Miinchen (DE) 



(54) METHOD AND APPARATUS FOR TRANSCODING OF COMPRESSED IMAGE 



(57) In an apparatus for transcoding a compressed 
video, a generator simulates constraints of a network 
and constraints of a user device. A classifier is coupled 
to receive an input compressed video and the con- 
straints. The classifier generates content information 



from features of the input compressed video. A manager 
produces a plurality of conversions modes dependent 
the constraints and content information, and atranscod- 
er produces output compressed videos, one for each of 
the plurality conversion modes. 
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Description 
Technical Field 

[0001] This invention relates generally to information 
delivery systems, and more particularly to delivery sys- 
tems that adapt information to available bit rates of a 
network. 

Background Art 



[0002] Recently a number of standards have been 
developed for communicating encoded information. For 
video sequences, the most widely used standards in- 
clude MPEG-1 (for storage and retrieval of moving pic- 
tures), MPEG-2 (for digital television) and H.263, see 
ISO/lEC JTC1 CD 11172, MPEG, "Information Technol- 
ogy- Coding of Moving Pictures and Associated Audio 
for Digital Storage Media up to about 1.5 Mbit/s - Part 
2: Coding of Moving Pictures Information," 1991 r LeGall, 
"MPEG: A Video Compression Standard for Multimedia 
Applications," Communications of the ACM, Vol. 34, No. 
4, pp. 46-58, 1991, ISO/lEC DIS 1 381 8-2. MPEG-2, "In- 
formation Technology - Generic Coding of Moving Pic- 
tures and Associated Audio Information - Part 2: Video," 
1994, ITU-T SG XV, DRAFT H.263, "Video Coding for 
Low Bitrate Communication," 1996, ITU-T SG XVI, 
DRAFT13 H.263+ Q15-A-60 rev.O, "Video Coding for 
Low Bitrate Communication," 1997. 
[0003] These standards are relatively low-level spec- 
ifications that primarily deal with the spatial and tempo- 
ral compression of video sequences. As a common fea- 
ture, these standards perform compression on a per 
frame basis. With these standards, one can achieve 
high compression ratios for a wide range of applications. 
[0004] Newer video coding standards, such as 
MPEG-4 (for multimedia applications), see "Information 
Technology - Generic coding of audio/visual objects;' 
ISO/lEC FDIS 14496-2 (MPEG4 Visual), Nov. 1998, al- 
low arbitrary-shaped objects to be encoded and decod- 
ed as separate video object planes (VOP). The objects 
can be visual, audio, natural, synthetic, primitive, com- 
pound, or combinations thereof. Video objects are com- 
posed to form compound objects or "scenes." 
[0005] The emerging MPEG-4 standard is intended to 
enable multimedia applications, such as interactive vid- 
eo, where natural and synthetic materials are integrat- 
ed', and where access is universal. MPEG-4 allows for 
content based interactivity. For example, one might 
want to"cut-and-paste" a moving figure or object from 
one video to another. In this type of application, it is as- 
sumed that the objects in the multimedia content have 
been identified through some type of segmentation 
process, see for example, U.S. Patent Application Sn. 
09/326,750 "Method for Ordering Image Spaces to 
Search for Object Surfaces" filed on June 4, 1 999 by Lin 
et al. 

[0006] In the context of video transmission, these 



compression standards are needed to reduce the 
amount of bandwidth (available bit rate) that is required 
by the network. The network can represent a wireless 
channel or the Internet. In any case, the network has 
5 limited capacity and a contention for its resources must 
be resolved when the content needs to be transmitted. 
[0007] Over the years, a great deal of effort has been 
placed on architectures and processes that enable de- 
vices to transmit the content robustly and to adapt the 
10 quality of the content to the available network resources. 
When the content has already been encoded, it is some- 
times necessary to further convert the already com- 
pressed bitstream before the stream is transmitted 
through the network to accommodate, for example, a 
15 reduction in the available bit rate. 

[0008] Bit stream conversion or "transcoding" can be 
classified as bit rate conversion, resolution conversion, 
and syntax conversion. Bit rate conversion includes bit 
rate scaling and conversion between a constant bit rate 
20 (CBR) and a variable bit rate (VBR). The basic function 
of bit rate scaling is to accept an input bitstream and 
produce a scaled output bitstream, which meets new 
load constraints of a receiver. A bit stream scaler is a 
transcoder, or filter, that provides a match between a 
25 source bitstream and the receiving load. 

[0009] As shown in Figure 1 , typically, scaling can be 
accomplished by a transcoder 100. In a brute force 
case, the transcoder includes a decoder 1 1 0 and encod- 
er 120. A compressed input bitstream 101 is fully de- 
30 coded at an input rate Rin, then encoded at a new output 
rate Rout 102 to produce the output bitstream 103. Usu- 
ally, the output rate is iowerthan the input rate. However, 
in practice, full decoding and full encoding in a transcod- 
er is not done due to the high complexity of encoding 
35 the decoded bitstream. 

[0010] Earlierworkon MPEG-2 transcoding has been 
published by Sun et al., in "Architectures for MPEG com- 
pressed bitstream scaling," IEEE Transactions on Cir- 
cuits and Systems for Video Technology, April 1996. 
40 There, four methods of rate reduction, with varying com- 
plexity and architecture, were presented. 
[0011] Figure 2 shows an example method. In this ar- 
chitecture, the video bitstream is only partially decoded. 
Morespecifically,macroblocksoftheinputbitstream201 

45 are variable-length decoded (VLD) 21 0. The input bit- 
stream is also delayed 220 and inverse quantized (IQ) 
230 to yield discrete cosine transform (DCT) coeffi- 
cients. Given the desired output bit rate, the partially de- 
coded data are analyzed 240 and a new set of quantiz- 
50 ers is applied at 250 to the DCT blocks. These re-quan- 
tized blocks are then variable-length coded (VLC) 
260and a new output bitstream 203 at a lower rate can 
be formed. This scheme is much simpler than the 
scheme shown in Fig. 1 because the motion vectors are 
55 re-used and an inverse DCT operation is not needed. 
[0012] More recent work by Assuncao et al., in "A fre- 
quency domain video transcoder for dynamic bit-rate re- 
duction of MPEG-2 bitstreams," IEEE Transactions on 
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Circuits and Systems for Video Technology, pp. 
953-957, December 1998, describe a simplified archi- 
tecture for the same task. They use a motion compen- 
sation (MC) loop, operating in the frequency domain for 
drift compensation. Approximate matrices are derived 5 
for fast computation of the MC blocks in the frequency 
domain. A Lagrangian optimization is used to calculate 
the best quantizer scales for transcoding. 
[0013] Other work^by Sorial et al, "Joint transcoding 
of multiple MPEG video bitstreams," Proceedings of the 10 
International Symposium on Circuits and Systems, Can 
1999, presents a method of jointly transcoding multiple 
MPEG-2 bitstreams, see also U.S. Patent Application 
Sn. 09/410,552 "Estimating Rate-Distortion Character- 
istics of Binary Shape Data," filed October 1, 1999 by '5 
Vetro et al. 

[001 4] According to prior art compression standards, 
the number of bits allocated for encoding texture infor- 
mation is controlled by a quantization parameter (OP). 
The above papers are similar in that changing theOP 20 
based on information that is contained in the original bit- 
stream reduces the rate of texture bits. For an efficient 
implementation, the information is usually extracted di- 
rectly in the compressed domain and can include meas- 
ures that relate to the motion of macroblocks or residual 25 
energy of DCT blocks. This type of analysis can be 
found in the bit allocation analyzer. 
[0015] Although in some cases, the bitstream can be 
preprocessed, it is still important that the transcoder op- 
erates in real-time. Therefore, significant processing de- 
lays on the bitstream cannot be tolerated. For example, 
it is not feasible for the transcoder to extract information 
from a group of frames and then to transcode the con- 
tent based on this look-ahead information. This cannot 
work for live broadcasts, or video conferencing. Al- 
though it is possible to achieve better transcoding re- 
sults in terms of quality due to better bit allocation, such 
an implementation for real-time applications is imprac- 
tical. 

[001 6] It is also important to note that classical meth- 
ods of transcoding are limited in their ability to reduce 
the bit rate. In other words, if only theQPof the outgoing 
video is changed, then there is a limit to how much one 
can reduce the rate. The limitation in reduction is de- 
pendent on the bitstream under consideration. Chang- 
ing the OPto a maximum value will usually degrade the 
content of the bitstream signif icantly. Another alternative 
to reducing the spatial quality is to reduce the temporal 
quality, i.e., drop or skip frames. Again, skipping too 
many frames will also degrade the quality significantly. 
If both reductions are considered, then the transcoder 
is faced with a trade-off in spatial versus temporal qual- 
ity. 

[0017] This concept of such a spatio-temporal trade- 
off can also be considered in the encoder. However, not 
all video-coding standards support frame skipping. For 
example, in MPEG-1 and MPEG-2, the Group of Picture 
(GOP) structure is pre-determined, i.e., the Intra frame 



4 

period and distance between anchor frames is fixed. As 
a result, all pictures must be encoded. To get around 
this temporal constraint, the syntax does allow macrob- 
locks to be skipped. If all macroblocks in a frame are 
skipped, then the frame has essentially been skipped. 
At least one bit is used for each macroblock in the frame 
to indicate this skipping. This can be inefficient for some 
bit rates. 

[0018] The H.263 and MPEG-4 standards do allow 
frame skipping. Both standards support a syntax that al- 
lows the a reference to be specified. However, there 
frame skipping has mainly been used to satisfy buffer 
constraints. In other words, if the buffer occupancy is 
too high and in danger of overflow, then the encoder will 
skip a frame to reduce the flow of bits into the buffer and 
give the buffer some time to send its current bits. 
[0019] A more sophisticated use of this syntax allows 
one to make the spatio-temporal trade-offs in non-emer- 
gency situations, i.e., code more frames at a lower spa- 
tial quality, orcode less frames at a higherspatial quality. 
Depending on the complexity of the content, either strat- 
egy can potentially lead to better overall quality. Meth- 
ods to control this trade-off in an MPEG-4 object-based 
encoder have been described in U.S. Patent No. 
5,969,764, "Adaptive video coding method", issued on 
October 19, 1999 to Sun et al., and in "MPEG-4 rate 
control for multiple video objects," IEEE Trans, on Cir- 
cuits and Systems for Video Technology, February 
1 999, by Vetro et al. There, two modes of operation were 
introduced, HighMode and LowMode. Depending on a 
current mode of operation, which was determined by the 
outgoing temporal resolution, adjustments in the way 
bits were allocated were made. 
[0020] Besides the work referenced above, methods 
to control this spatio-temporal trade-off have received 
minimal attention. Furthermore, the information that is 
available in the transcoder to make such decisions is 
quite different than that of the encoder. In the following, 
methods for making such trade-offs in the transcoder 
are described. 

[0021] As a result, the transcoder must find some al- 
ternate means of transmitting the information that is con- 
tained in a bitstream to adapt to reductions in available 
bit rates. 

[0022] The most recent standardization effort taken 
on by the MPEG standard committee is that of MPEG- 
7, formally called"Multimedia Content Description Inter- 
face," see "MPEG-7 Context. Objectives and Technical 
Roadmap," ISO/IEC N2861 , July 1 999. Essentially, this 
standard plans to incorporate a set of descriptors and 
description schemes that can be used to describe vari- 
ous types of multimedia content. The descriptor and de- 
scription schemes are associated with the content itself 
and allow for fast and efficient searching of material that 
is of interest to a particular user. It is important to note 
that this standard is not meant to replace previous cod- 
ing standards, rather, it builds on other standard repre- 
sentations, especially MPEG-4, because the multime- 



EP 1 248 466 A1 



35 



40 



45 



50 



3 



BNSDCCID: <EP 1248466A1 J_> 



EP 1 248 466 A1 



dia content can be decomposed into different objects 
and each object can be assigned a unique set of de- 
scriptors. Also, the standard is independent of the for- 
mat in which the content is stored. 
[0023] The primary application of MPEG-7 is expect- 
ed to be search and retrieval applications, see "MPEG- 
7 Applications," ISO/IEC N2861 , July 1999. In a simple 
application environment, a user can specify some at- 
tributes of a particular object. At this low-level of repre- 
sentation, these attributes can include descriptors that 
describe the texture, motion and shape of the particular 
object. A method of representing and comparing shapes 
has been described in U.S. Patent Application Sn. 
09/326,759 "Method for Ordering Image Spaces to Rep- 
resent Object Shapes" filed on June 4, 1999 by Lin et 
al., and a method for describing the motion activity has 
been described in U.S. Patent Application Sn. 
09/406,444 "Activity Descriptor for Video Sequences" 
filed on September 27, 1999 by Divakaran et al. To ob- 
tain a higher-level of representation, one can consider 
more elaborate description schemes that combine sev- 
eral low-level descriptors. In fact, these description 
schemes can even contain other description schemes, 
see n MPEG-7 Multimedia Description Schemes WD 
(V1 .0)," ISO/IEC N31 1 3 : December 1 999 and U.S. Pat- 
ent Application Sn. 09/385,169 "Method for represent- 
ing and comparing multimedia content', 1 filed August 30, 
1999 by Lin et al. 

[0024] These descriptors and description schemes 
that will be provided by the MPEG-7 standard allow one 
access to properties of the video content that cannot be 
derived by a transcoder. For example, these properties 
can represent look-ahead information that was as- 
sumed to be inaccessible to the transcoder. The only 
reason that the transcoder has access to these proper- 
ties is because the properties have been derived from 
the content earlier, i.e., the content has been pre-proc- 
essed and stored in a database with its associated me- 
ta-data. 

[0025] The information itself can be either syntactic or 
semantic, where syntactic information refers to the 
physical and logical signal aspects of the content, while 
the semantic information refers to the conceptual mean- 
ing of the content. For a video sequence, the syntactic 
elements can be related to the color, shape and motion 
of a particular object. On the other hand, the semantic 
elements can refer to information that cannot be extract- 
ed from low-level descriptors, such as the time and 
place of an event or the name of a person in a video 
sequence. 

[0026] Given the background on traditional methods 
of transcoding and the current status of the MPEG-7 
standard, there exists a need to define an improved 
transcoding system that utilizes information from both 
sides. 



Disclosure of Invention 

[0027] In an apparatus for transcoding a compressed 
video, a generator simulates constraints of a network 

5 and constraints of a user device. A classifier is coupled 
to receive an input compressed video and the con- 
straints. The classifier generates content information 
from f eatu res of the input compressed video . A manager 
produces a plurality of a conversions modes dependent 

w the constraints and content information, and a transcod- 
er produces output compressed videos, one for each of 
the plurality conversion modes. 



Brief Description of Drawings 

15 

[0028] 

Figure 1 is a block diagram of a prior art transcoder; 
Figure 2 is a block diagram of a prior art partial de- 

20 coder/encoder; 

Figure 3 is a block diagram of an adaptable bit- 
stream delivery system according to the invention; 
Figure 4 is a block diagram of an adaptable trans- 
coder and transcoder manager; 

25 Figure 5 is a graph of transcoding functions that can 
be used by the transcoder and manager of Figure 4; 
Figure 6 is a block diagram of object-based bit- 
stream scaling 

Figure 7 is a graph of a search space; and 
30 Figure 8 is a block diagram of details of an object- 
based transcoder according to the invention: 
Figure 9 is a block diagram of feature extraction ac- 
cording to cue levels 

Figure 1 0 is a block diagram of a video content clas- 

35 sifier with three stages; 

Figure 1 1 is a block diagram of descriptor schemes: 
Figure 12 is a block diagram of transcoding accord- 
ing to the descriptor schemes of Figure 11a; 
Figure 1 3 is a block diagram of transcoding accord- 

40 ing to the descriptor schemes of Figure 11 b; 

Figure 1 4 is a block diagram of a system for gener- 
ating content summaries and variations of content 
according to the content summaries; and 
Figure 15 is graph of transcoding functions based 

45 on the content summaries and content variations of 
Figure 14. 

Best Mode for Carrying Out the Invention 

so [0029] We describe a video delivery system that is ca- 
pable of converting, or "scaling," a compressed input bit- 
stream to a compressed output bitstream at a target 
rate, i.e., an available bit rate (ABR) of a network. We 
also describe a delivery system that delivers variations 

55 of the compressed input bitstream. Furthermore, we de- 
scribe transcoding based on low-level features and de- 
scriptor schemes of bitstreams. 
[0030] Usually the target rate of the output bitstream 
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is less than the rate of the input bitstream. In other 
words, the task of ourtranscoder is to further compress 
the bitstream, usually due to constraints in network re- 
sources or receiver load in an end-user device. We de- 
scribe content-based transcoding techniques for vari- 
ous levels of a video; the levels including a program lev- 
el, a shot level, a frame level and video object level, and 
a sub-region level. It is our goal to perform transcoding 
while maximizing rate-quality {RQ) characteristics. 
[0031 ] Our system is capable of overcoming the draw- 
backs of conventional transcoders, namely limitations 
in rate conversion, particularly in real-time applications. 
Although conventional transcoding techniques can suf- 
ficiently reduce the rate, the quality of the content is usu- 
ally severely degraded. Often, the information that is 
conveyed in the reduced bit rate bitstream is lost alto- 
gether. Conventionally, bitstream "quality" is measured 
as bit-by-bit differences between the input and output 
bitstreams. 

[0032] We describe transcoding techniques that are 
able to achieve the target rate while maintaining the 
quality of the content of the bitstream. 

[Continuous Conversion] 

[0033] Conventional frame-based transcoding tech- 
niques can be defined as continuous-conversions. Be- 
cause conventional techniques attempt to continuously 
maintain the best trade-off in spatial vs. temporal quality, 
the output is always a sequence of frames that best rep- 
resents the input sequence. When a particular frame is 
skipped to meet constraints on the rate, the information 
that is contained within the skipped frame is not consid- 
ered. If enough frames are skipped, then the bitstream 
that is received is meaningless to a user, or at the very 
best, less than satisfactory. 

[Quality Distortion Metrics] 

[0034] A conventional continuous-conversion trans- 
coder makes optimal decisions in the rate-distortion 
sense with regard to trade-offs in spatial and temporal 
quality. In such atranscoder, the distortion is usually tak- 
en to be any classic distortion metric, such as the peak 
signal to noise ratio {PSNR). It should be emphasized 
that in such a conversion, the distortion is not a measure 
of how well the content of the bitstream is being con- 
veyed, but rather of the bit-to-bit differences between 
the original input bitstream and the reconstructed output 
bitstream, i.e., the quality. 

[Fidelity of Bitstream] 

[0035] In one embodiment fortranscoding a bitstream 
sequence under low bit rate constraints, we summarize 
the content of the bitstream with a small number of 
frames. In this way, we do not use the classic distortion 
metrics focused on quality. Rather, we adopt a new 



measure we call "fidelity" Fidelity takes into considera- 
tion the semantics and syntax of the content. By the se- 
mantics and syntax, we do not mean the bits or pixels, 
but rather humanly meaningful concepts represented by 
5 the bits, for example, words, sounds, level of humor and 
action of videos, video objects, and the like. 
[0036] Fidelity can be defined in a number of ways. 
However fidelity, as we define it, is not related to con- 
ventional quantitative quality, e.g., the bit-by-bit differ- 
to ences. Rather, our fidelity measures the degree to which 
a frame or any number of frames conveys the informa- 
tion contained in the original image sequence, i.e., the 
content or higher level meaning of the information that 
is conveyed, and not the raw bits. 

15 

[Discrete-Summary Transcoder] 

[0037] Fidelity is a more subjective or semantic meas- 
ure than conventional distortion metrics. However, in our 

20 system, fidelity is a useful measure to gauge the non- 
conventional transcoder J s performance. Because the 
output of our transcoder according to one embodiment 
is a finite set of relatively high quality frames that attempt 
to summarize the entire sequence of bits, we refer to 

25 this type of transcoder as a "discrete-summary trans- 
coder." 

[0038] For example, at low bit rates, we choose a 
small number of high quality frames to represent the vid- 
eo. In this way the semantic "meaning" of the bitstream 

30 is preserved. It can be stated that this discrete-summary 
transcoder performs a high-level semantic sampling of 
the input bitstream, whereas continuous transcoders 
only sample pixels quantitatively in the spatial and tem- 
poral domains. In situations where the bit rate is severe- 

35 |y limited, we sample "rich" frames to preserve the fidel- 
ity of the content encoded in the bitstream. 
[0039] Because we selectively sample rich frames, 
we can lose one aspect in the bitstream - motion. Pref- 
erably, we resort to discrete-summary transcoding only 

40 when the rate-distortion performance of the continuous- 
conversion transcoder is severely degraded or cannot 
meet the target rate. Under these conditions, conven- 
tional continuous-conversion transcoders lose fluid mo- 
tion because the frame rate is so low that the rate of 

is information delivery becomes jerky and disturbing to the 
user. 

[0040] The major gain of discrete-summary transcod- 
ing over conventional continuous-conversion transcod- 
ing is that discrete-summary transcoders attempts to 

50 choose frames that are rich in information , whereas con- 
tinuous-conversion transcoders under severe rate con- 
straints will drop frames that are rich in information. 
[0041 ] In order to control which transcoder is best for 
the given situation, we describe a content-network-de- 

55 vice (CND) manager. The purpose of the CND manager 
is to select which transcoder to use. The selection is 
based on data obtained from content, network, user de- 
vice characteristics. We can also simulate these device 
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characteristics in an "off-line" mode to generate varia- 
tions of bitstream for later delivery. 

[Adaptable Bitstream Delivery System] 



[0042] As shown in Figure 3 } an adaptable bitstream 
delivery system 300 includes four major components: a 
content classifier 31 0 } a model predictor 320, a content- 
network-device manager 330 and a switchable trans- 
coder 340. 

[0043] The goal of the system 300 is to deliver a com- 
pressed bitstream 301 with information content through 
a network 350 to a user device 360. The content of the 
bitstream can be visual, audio, textual, natural, synthet- 
ic, primitive, data, compound or combinations thereof. 
The network can be wireless, packet-switched, or other 
networks with unpredictable operational characteristic. 
The user device can be a video receiver, a wireless re- 
ceiver stationary or mobile, or other like user devices 
with internal resource constraints that may make quality 
reception of the bitstream difficult. 
[0044] As an advantage, the system maintains the se- 
mantic fidelity of the content even when the bitstream 
needs to be further compressed to meet network and 
user device characteristics. 

[0045] The input compressed bitstream is directed to 
the transcoder and the content classifier. The transcod- 
er can ultimately reduce the rate of an output com- 
pressed bitstream 309 directed through the network at 
the user device. 

[0046] The content classifier 310 extracts content in- 
formation (CI) 302 from the input bitstream for the man- 
ager. The main function of the content classifier is to 
map semantic features of content characteristics, such 
as motion activity, video change information and texture, 
into a set of parameters that are used to make rate-qual- 
ity trade-offs in the content-network manager. To assist 
with this mapping function, the content classifier can al- 
so accept meta-data information 303. The meta-data 
can be low-level and high-level. Examples of meta-data 
include descriptors and description schemes that are 
specified by the emerging MPEG-7 standard. 
[0047] In this architecture, the model predictor 320 
provides real-time feedback 321 regarding the dynam- 
ics of the network 350, and possible constraining char- 
acteristics of the user device 360. For example, the pre- 
dictor reports network congestion and available bit rate 
(ABR). The predictor also receives and translates feed- 
back on packet loss ratios within the network. The pre- 
dictor estimates a current network state, and long-term 
network predictions 321 . Characteristically, the user de- 
vice can have limited resources. For example, process- 
ing power, memory, and display constraints. For exam- 
ple, if the user device is a cellular telephone, then the 
display can be constrained to textual information or low- 
resolution images, or even worse, only audio. These 
characteristics can also impact the selection of a trans- 
coding modality. 



[0048] In addition to receiving the meta-data 303, the 
manager 330 also receives input from both the content 
classifier 310 and the model predictor 320. The CND 
combines output data from these two sources of infor- 
5 mation together so that an optimal transcoding strategy 
is determined for the switchable transcoder 340. 



[Content Classifier] 

10 [0049] In the field of pattern analysis and recognition, 
classification can be achieved by extracting features 
from various levels of the video. For example, program 
features, shot features, frame features, and features of 
sub-regions within frames. The features themselves can 
15 be extracted using sophisticated transforms or simple 
local operators. Regardless of how the features are ex- 
tracted, given a feature space of dimension^ each pat- 
tern can be represented as a point in this feature space. 
[0050] It is customary to subject a variety of different 
20 training patterns as input to this extraction process and 
to plot the outcomes in feature space. Provided that the 
feature set and training patterns are appropriate, we ob- 
serve several clusters of points called "classes." These 
classes allow us to distinguish different patterns and 
25 group similar patterns, and to determine boundaries be- 
tween the observed classes. Usually, the boundaries 
between classes adhere to some cost for misclassifica- 
tion and attempt to minimize the overall error. 
[0051] After appropriate classes have been identified 
30 and suitable boundaries between the classes have been 
drawn, we can quickly classify new patterns in the bit- 
stream. Depending on the problem, this can be accom- 
plished with a neural network or other known classifica- 
tion techniques such as Support Vector Machines, see 
35 Cristianini et al. in "An Introduction to Support Vector 
Machines, (and other kernel-based learning methods)," 
Cambridge University Press, 2000. 
[0052] The content classifier 310 operates in three 
stages (I, II, and III 311-313). First, we classify the bit- 
40 stream content so that higher-level semantics can be 
inferred, and second, we adapt the classified content to 
network and user device characteristics. 
[0053] In the first stage (I) 31 1 , we extract a number 
of low-level features from the compressed bitstream us- 
45 ing conventional techniques, for example, motion activ- 
ity, texture, or DCT coefficients. We can also access the 
meta-data 303, such as MPEG-7 descriptors and de- 
scription schemes. If the meta-data are available, then 
less work needs to be performed on the compressed 
so bitstream. As a final outcome of this first stage, a pre- 
determined set of content features are mapped to a finite 
set of semantic classes or high-level meta-data. Fur- 
thermore, within each semantic class, we differentiate 
based on the coding complexity, i.e., the complexity is 
55 conditional on the semantic class and network charac- 
teristics, and possibly device characteristics. 
[0054] This high-level understanding of the content is 
passed onto the CND manager 330 as content informa- 
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tion (CI) 302. The CI 302, in part, characterizes the po- 
tential performance of this embodiment of the switcha- 
ble transcoder. 

[0055] The above classification is useful in terms of 
content understanding, and, ultimately discrete-sum- 
mary transcoding, but it is also useful as an intermediate 
stage result. Essentially, we have a new set of classes 
that serve as input to the second stage II 312 of classi- 
fication. In the second stage of classification, we map 
our semantic classes to features of network and device 
characteristics. These features will help us to determine 
the characteristics of rate-quality functions that assist 
the system in developing a transcoding strategy. In other 
words, if it is probable that a certain semantic class is 
characterized by bursty data due to object movement or 
video changes, then this should be accounted for when 
estimating how much resource the network should pro- 
vide. The third stage 31 3 is described below with respect 
to other embodiments. 

[Content- Network- Device Manager] 

[0056] The content-network-device (CND) manager 
330 and transcoder 340 are shown in greater detail in 
Figure 4. The CND manager includes a discrete-contin- 
uous control 431 and a content-network-device (CND) 
integrator 432. The transcoder 340 includes a plurality 
of transcoder 441 -443. 

[0057] The control 431 , using a switch 450, is respon- 
sible for deciding how the input compressed bitstream 
301 should be transcoded ; e.g., with the discrete sum- 
mary transcoder 441 , the continuous conversion trans- 
coder, 442, or some other transcoder 443. The network- 
content manager also dynamically adapts to a target 
rate for the transcoder and considers resource con- 
straining characteristics of the network and user device. 
These two very important items are decided by the con- 
trol 431. 

[0058] To better understand how the control makes 
optimal selection decisions, Figure 5 graphs a plurality 
of rate-quality functions with respect to rate 501 and 
quality 502 scales. One rate-quatity function of the con- 
tinuous-conversion transcoder 442 is. shown by a con- 
vex function 503. The rate-quality curve for the discrete- 
summary transcoder 441 is represented by a linear 
function 504. Other transcoders can have different func- 
tions. 

[0059] It should be noted that these curves are only 
drawn for illustrative purposes. The true forms of the 
functions for a particular transcoder can vary depending 
on the content, how the content has been classified and 
possibly the current state of the network and device con- 
straining characteristics. Obviously, at low bit rates, the 
continuous-conversion transcoder degrades rapidly in 
quality, forthe reasons stated above. The optimal quality 
function 505 is shown in bold. This function best models 
the optimal quality that can be achieved for a given bit 
rate and user device. 



[0060] We note there is a crossover in transcoding 
technique at a rate = 7506. For rates greater than T, it 
is best to use the continuous-conversion transcoder, 
and for rates less than T, it is best to use the discrete- 
5 summary transcoder. Of course, the crossover point will 
vary dynamically as content and network characteristics 
vary. 

[0061] As mentioned above, continuous-conversion 
transcoders usually assume classic distortion metrics, 

10 such as PSNR. Because such measures do not apply 
to our discrete-summary transcoder, it makes more 
sense to map the classic distortion metrics to a measure 
of "fidelity." Fidelity measures how well the content is 
semantically summarized, and not the quantitative bit- 

15 by-bit difference. Given the same quality metric, we 
avoid any inconsistency in deciding the optimal trans- 
coding strategy. 

[Content-Network-Device Integrator] 

20 

[0062] Referring back to Figure 4, the CND integrator 
432 is the part of the CND manager that combines to- 
gether content information 302 from the content classi- 
fier 310 and network-device predictions 321 from the 

25 model predictor. It is this part of the manager that gen- 
erates the model expressed as the rate-quality functions 
shown in Fig. 5, or other like optimization functions. To 
form the optimal operating model 321 , the CND integra- 
tor, examines the mappings CI from the content classi- 

30 fier and bit rate feedback 351 that is output from the 
switchabte transcoder 340. Using this information, the 
integrator chooses the optimal modeling function 505^ 
that has certain model parameters. The rate feedback 
351 is used to dynamically refine the parameters. If the 

35 integrator finds that the chosen model is not optimal, 
then the integrator can decide to dynamically switch 
rate-quality functions. Also, the integrator can track sev- 
eral functions for different objects or different bitstreams 
and consider the functions either separately or jointly. 

40 

[Impact of Network Predictions] 

[0063] The network predictions 321 can affect these 
characteristic functions by modulating certain portions 

45 of the optimal curve 505 one way or another. For in- 
stance, when higher bit rates are available, one still 
needs to be most caref ul. The network model can allows 
us to expend a high number of bits at a particular time 
instant, but long-term effects tell us that congestion is 

50 likely to build quickly, therefore, our system can choose 
to hold back and continue to operate at a lower rate. 
Thus, we avoid problems related to a sudden drop in the 
available bit rate. These types of characteristics can be 
accounted for by modulating the curves of ourtranscod- 

55 er. 
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[Impact of Device Constraints] 

[0064] Device characteristics need also to be consid- 
er. Mobile devices have different operating characteris- 
tics than stationary devices ; for example Doppler spread 
can degrade performance at higher available bit rates. 
Thus, a lower bit rate should be selected. The device 
can have limited processing, storage and display capa- 
bilities that can impact the transcoder. For example, 
there is no point in delivering a video to an audio only 
device. In fact, the switchable transcoder can include 
another transcoder 443 that converts speech to text, or 
data to speech, etc. The important point is that the 
present switchable transcoder takes the semantics of 
the bitstream content and the destination device into 
consideration, most prior art transcoders just consider 
the available bit rate. 

[Frame-Based Transcoder] 

[0065] The details of frame-based trancoding number 
of transcoders are known in the prior art. For example, 
see any of the following U.S. Patents: 5,991 ,71 6 - Trans- 
coder with prevention of tandem coding of speech; 
5,940,130 - Video transcoder with by-pass transfer of 
extracted motion compensation data; 5,768,278 - N: 1 
Transcoder; - 5,764,298 Digital data transcoder with re- 
laxed internal decoder/coder interface frame jitter re- 
quirements; - 5,526,397 - Switching transcoder; 
5,334,977 - ADPCM transcoder wherein different bit 
numbers are used in code conversion, or other like pat- 
ents. None of these describe our technique for selecting 
a particular transcoding strategy depending on the se- 
mantic content of the bitstream and network character- 
istics. Below, we will also describe an object-based bit- 
stream transcoder that can be selected 
[0066] The emphasis of this embodiment is to enable 
the dynamic selection of a transcoding strategy that 
gives the best delivery of the semantic content of the 
bitstream, and not how the actual transcoding is per- 
formed. 

[0067] So far we have described the different types of 
trade-offs that can be made by a switchable transcoder 
including a continuous-conversion transcoder and a dis- 
crete-summary transcoder. In each of these transcod- 
ers, an optimal rate-quality curve is assumed. 

[Object-Based Transcoding] 

[0068] We now describe details how the rate-quality 
curve for continuous-conversion transcoders is derived 
and how suitable encoding parameters such as the QP 
and the amount of frame skip are determined. We also 
extend this work to the context of MPEG-4. We describ- 
ing a framework that adaptivelytranscodes or scales o£>- 
jects in the video, or scene, based on available bit rate 
and complexity of each vide object. 
[0069] Our scheme is adaptive in that various tech- 
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niques can be employed to reduce the rate depending 
on the ratio of incoming to outgoing rate. Because our 
goal is to provide the best overall quality for objects of 
varying complexity, the degradation of each object need 
5 not the same. Note, here we parse objects, and not 
frames as described above. 

[0070] The novelty of our system is that it is capable 
of transcoding multiple objects of varying complexity 
and size, but more important, our system is capable of 
10 making spatio-temporal trade-offs to optimize the over- 
all quality of the video. We focus on object-based bit- 
streams due to the added flexibility. We also describe 
various means that are available to manipulate the qual- 
ity of a particular object. 
15 [0071] The main point worth noting is that the objects 
themselves need not be transcoded with equal quality. 
For example, the texture data of one object can be re- 
duced, keeping intact its shape information, while the 
shape information of another object is reduced, keeping 
20 its texture information intact. Many other combinations 
can also be considered, including dropping frames. In a 
news clip , for example, it is possible to reduce the frame 
rate along with the texture and shape bits for the back- 
ground., while keeping the information associated with 
25 the foreground news reader intact. 

[Quality of a Bitstream for Object-Based 
Transcoding] 

30 [0072] As stated above, conventional frame-based 
transcoders can reduce the bit rate sufficiently. Howev- 
er, the quality of the content can be severely degraded 
and the information that is conveyed in the reduced bit 
rate bitstream can be lost altogether. Conventionally, bit- 
35 stream "quality" is measured as the bit-by-bit differenc- 
es between the input and output bitstreams. 
[0073] However, in object-based transcoding accord- 
ing to the invention, we are no longer constrained to ma- 
nipulate the entire video. We transcode a bitstream that 
40 has been decomposed into meaningful video objects. 
We realize that the delivery of each object, along with 
the quality of each object, has a different overall impact 
on quality. Because our object-based scheme has this 
finer level of access, it becomes possible to reduce the 
45 level of spatio-temporal quality of one object without sig- 
nificantly impacting the quality of the entire stream. This 
is an entirely different strategy than used by convention- 
al frame-based transcoders. 

[0074] In contrast to conventional bitstream quality, 
so which measures the bit-by-bit differences of the entire 
video without regard to content, we introduce the notion 
of "perceptual video quality." Perceptual video quality is 
related to the quality of objects in the video that convey 
the intended information. For instance, the background 
55 of a video can be completely lost without effecting the 
perceptual video quality of a more important foreground 
object. 
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[Object-Based Transcoding Framework] 

[0075] Figure 6 shows a high-level block diagram of 
an object-based transcoder 600 according to an alter- 
native embodiment of the invention. The transcoder600 
includes a demultiplexer 601 , a multiplexer 602, and an 
output buffer 603. The transcoder 600 also includes one 
or more object-based transcoders 800 operated by a 
transcoding control unit (TCU) 610 according to control 
information 604. The unit 610 includes shape, texture, 
temporal, and spatial analyzers 611-614. 
[0076] An input compressed bitstream 605 to the 
transcoder 600 includes one or more an object-based 
elementary bitstreams. The object-based bitstreams 
can be serial or parallel. The total bit rate of the bitstream 
605 \sR in . The output compressed bitstream 606 from 
the transcoder 600 has a total bit rate R out such that R out 

<Rin- 

[0077] The demultiplexer 601 provides one or more 
elementary bitstream to each of the object-based trans- 
coders 800, and the object-based transcoders 800 pro- 
vide object data 607 to the TCU 610. The transcoders 
800 scale the elementary bitstreams. The scaled bit- 
streams are composed by the multiplexer 602 before 
being passed on to the output buffer 603, and from there 
to a receiver. The buffer 606 also provides rate-feed- 
back information 608 to the TCU. 
[0078] As stated above, the control information 604 
that is passed to each of the transcoders 800 is provided 
by the TCU. As indicated in Figure 6, the TCU is respon- 
sible for the analysis of texture and shape data, as well 
as temporal and spatial resolution. All of these new de- 
grees of freedom make the object-based transcoding 
framework very unique and desirable for network appli- 
cations. As with the MPEG-2 and H.263 coding stand- 
ards, MPEG-4 exploits the spatio-temporal redundancy 
of video using motion compensation and DCT As a re- 
sult, the core of our object-based transcoders 800 is an 
adaptation of MPEG-2 transcoders that have been de- 
scribed above. The major difference is that shape infor- 
mation is now contained within the bitstream, and with 
regard to texture coding, tools are provided to predict 
DC and AC for Intra blocks. 

[0079] It is also important to note that the transcoding 
of texture is indeed dependent on the shape data. In oth- 
er words, the shape data cannot simply be parsed out 
and ignored; the syntax of a compliant bitstream de- 
pends on the decoded shape data. 
[0080] Obviously, our object-based input and output 
bitstreams 601 -602 are entirely different than traditional 
frame-based video programs. Also. MPEG-2 does not 
permit dynamic frame skipping. There, the GOP struc- 
ture and reference frames are usually fixed. 

[Texture Models] 

[0081] The use of texture models for rate control in an 
encoder has been extensively described in the prior art, 



see for example, "MPEG-4 rate control for multiple video 
objects," IEEE Trans, on Circuits and Systems for Video 
Technology, February 1 999, by Vetro et al, and referenc- 
es therein. 

5 [0082] In a texture model as used in our object-based 
transcoders 800, a variable R represents the texture bits 
spent for a video object (VO), a variable O denote the 
quantization parameter QP, variables (X 1: X 2 ) the first 
and second-order model parameters, and a variable S 

10 the encoding complexity, such as the mean absolute dif- 
ference. The relation between/? and Q is given by: 



[0083] Given the target amount of bits that are as- 
signed to a VO, and the current value of S, the value of 
20 Q depends on the current value of (X t ,X 2 ). After a VO 
has been encoded, the actual number of bits that are 
spent is known, and the model parameters can be up- 
dated. This can be done by linear regression using re- 
sults of previous n frames. 

25 

[Texture Analysis] 

[0084] The transcoding problem is different in that O, 
the set of original QPs, and the actual number of bits 
30 are already given. Also, rather than computing the en- 
coding complexity S from the spatial domain, we must 
define a new DCT-based complexity measure, s . This 
measure is defined as: 

35 

40 where B m (i) are the AC coefficients of a block, m is a 
macroblock index in the set Mof coded blocks, M c is the 
number of blocks in that set, and p(/) is a frequency de- 
pendent weighting. The complexity measure indicates 
the energy of the AC coefficients, where the contribution 

45 of high frequency components is lessened by the 
weighting function. This weighting function can be cho- 
sen to mimic that of an MPEG quantization matrix. 
[0085] From the data transmitted in the bitstream, and 
the data from past video objects, the model parameters 

so can be determined, and continually updated. Actually, 
we can update the model twice for every transcoded 
VOP; once before transcoding using data in the bit- 
stream, then again after coding the texture with the new 
set of QPs, O. With this increased number of data 

55 points, the model parameters are more robust and con- 
verge faster. 

[0086] The main objective of our texture analysis is 
choosing Q which satisfy the rate constraint while min- 
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imizing distortion. However, it is important to note that 
optimality is conditioned on O. Therefore, we must take 
care in how the distortion is quantified. From this point 
on, we will refer to this distortion as a conditional distor- 
tion due to the dependence on Q. 
[0087] One way to determine Q is to utilize the same 
methodology as used in the rate control problem. This 
way, we first estimate a budget for all VOP's at a partic- 
ular time instant, adjust the target to account for the cur- 
rent level of the buffer, then distribute this sum of bits to 
each object. Given these object-based target bit rates, 
the new set of QPs can be determined from our texture 
model. The main problem with this approach is that we 
rely on the distribution of bits to be robust. In general, 
the distribution is not robust and the ability to control our 
conditional distortion is lost because the new QPs have 
been computed independent of the original ones. 

[Conditional Distortion] 

[0088] To overcome this problem, and to attempt to 
solve for O in some way that is dependent on Q, we 
describe a method based on a dynamic programming. 
To maintain as close a quality as possible to the original 
quality, the QPs of each object should change as little 
as possible. Given this, we can define a conditional dis- 
tortion as: 



where ^denotes a VOP index in the set of VOPs, K. and 
a k represents the visual significance or priority of ob- 
jects Note, although D(Q) is not explicitly specified, we 
know that it is proportional to Q. The visual significance 
can be a function of the objects relative size and com- 
plexity. 

[QP Search Space] 

[0089] It is important to note that Q' k > Q k .. for all k> 
Therefore, the solution space is limited to a valid solution 
space shown in Figure 7. In Figure 7, the x-axis indicates 
video objects, 701 , and the y-axis QP. The Figure also 
shows a valid search space 710. a constrained search 
space 711 , a valid path 71 2, and an invalid path 713. 
[0090] Given the above quantification for conditional 
distortion, we solve our problem by searching for the 
best path through the trellis of Figure 7. where the valid 
QPs are nodes in the trellis, and each node is associ- 
ated with an estimated rate and conditional distortion. 
Formally, the problem can be stated as: 

m\nD(a\ Q) subject to R total * R budget 



[0091] Converting the constrained problem into an 
unconstrained problem solves this problem, where the 
rate and distortion are merged through a Lagrangian 
multiplier, X. For any X>0, the optimal solution can al- 
5 ways be found. To determine the value of A. that satisfies 
the constraint on the rate, the well-known bisection al- 
gorithm can be used, see Ramchandran and Vetterli, 
"Best wavelet packet bases in the rate-distortion sense," 
IEEE Trans. Image Processing, April 1993. 
to [0092] It is important to emphasize that the search 
space considered is much less than found in MPEG-2 
transcoding algorithms. There, an attempt is made to 
find the best set of quantizers for every macroblock. In 
contrast here, we only search for object-based quantiz- 
es ers. Hence, our approach is very practical. 

[Temporal Analysis] 

[0093] Generally speaking, the purpose of skipping 
20 frames is to reduce the buffer occupancy level so that 
buffer overflow, and ultimately the loss of packets, is pre- 
vented. Another reason to skip frames is to allow a 
trade-off between the spatial and temporal quality. In 
this way, fewer frames are coded, but they are coded 
25 with higher quality. Consequently, if the buffer is not in 
danger of overflowing, then the decision to skip a frame 
is incorporated into the QP selection process. 
[0094] Building from the proposed technique forOP 
selection, which searches a valid solution space for a 
30 set of QPs, we achieve this spatial - temporal trade-off 
by constraining the solution space. As shown in Figure 
7, a valid path is one in which all elements of O fall in 
the constrained area. If one of these elements falls out- 
side the area, then the path is invalid in that it is not main- 
35 taining some specified level of spatial quality. The spa- 
tial quality is implied by the conditional distortion. 
[0095] Different criteria can be used to determine the 
maximum OP for a particular object. For example, the 
maximum value can be a function of the object complex- 
40 ity or simply a percentage of the incoming OP. In the 
case where the maximum is based on complexity, the 
transcoder essentially limits those objects with higher 
complexity to smaller QPs, because their impact on spa- 
tial quality is most severe. On the other hand, limiting 
45 the complexity based on the incoming QP implies that 
the transcoder maintains a similar OP distribution as 
compared to the originally encoded bitstream. Both ap- 
proaches are valid. Trade-offs to determinethe best way 
to limit the OPfor each object can depend on trade-offs 
so between spatial and temporal quality. 

[0096] Of course, one of the advantages in dealing 
with object-based data are that the temporal quality of 
one object can be different from another. In this way, 
skipping the background object, e.g., stationary walls, 
55 for example can save bits. However, because objects 
are often disjoint, reducing the temporal resolution of 
one object can cause holes in the composed video. Im- 
posing the constraint that all VOP's have the same tern- 
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poral resolution can reduce this problem. 
[Shape Analysis] 

[0097] To introduce the problems with transcoding 
shape data of video objects, we recall how texture infor- 
mation is transcoded. It is well known that the rate for 
texture can be reduced by a partial decoding of the data. 
In most cases, this partial decoding requires at least the 
variable-length decoding (VLD) operation to be per- 
formed. The inverse quantization and inverse DCT can 
be omitted. 

[0098] However, for shape data, this is not the case. 
In MPEG-4, the shape data are coded on a per block 
basis by the so-called context-based arithmetic encod- 
ing algorithm, see Brady, "MPEG-4 standardization 
methods for the compression of arbitrarily shaped ob- 
jects," IEEE Trans Circuits and Systems for Video Tech- 
nology, December 1999. With this algorithm, a context 
for each pixel is computed based on either a 9-bit or 
1 0-bit causal template, depending on the chosen mode. 
This context is used to access a probability look-up ta- 
ble, such that the sequence of probabilities within a 
block drives an arithmetic encoder. 
[0099] In contrast to the texture, partial decoding of 
the shape is not possible because there is no interme- 
diate representation between the pixel domain and the 
bitstream. Therefore, in order to manipulate the resolu- 
tion of the shape data, the data must be fully decoded. 
After decoding, models such as described in U.S. Patent 
Application Sn. 09/410,552 "Estimating Rate-Distortion 
Characteristics of Binary Shape Data," filed October 1, 
1999 by Vetro et al, can be used to evaluate the rate- 
distortion characteristics of the shape. 

[Spatial Analysis] 

[0100] Another means of reducing the rate is to re- 
duce the resolution by subsampling. In version 2 of the 
MPEG-4 standard, a tool called Dynamic Resolution 
Conversion (DRC) has been adopted into the MPEG-4 
standard. With this tool it is be possible to reduce the 
resolution, i.e., spatial quality, of one object, while main- 
taining the resolution of other more important or spatially 
active objects. 

[Architecture] 

[0101] Figure 8 shows the components of an object- 
based transcoder 800 according to our invention. As 
with transcoding architectures in the prior art, the syntax 
of encoding standards somewhat dictates the architec- 
ture of the transcoder 800. We will now describe the ma- 
jor features of our transcoder in light of the MPEG-4 
standard and contrast these features with traditional 
frame-based transcoding. 

[01 02] The transcoder 800 includes a VOL7VO P pars- 
er 810, a shape scaler 820, a MB header parser 830, a 



motion parser 840, and a texture scaler 850. The trans- 
coder also includes a bus 860 that transfers various 
parts of the elementary bitstream 801 to a bitstream 
memory 870. From this global storage, the elementary 
5 bitstreams composition unit 880 can form a reduced rate 
compressed bitstream, compliant with the MPEG-4 
standard. The output elementary bitstream 809 is fed to 
the multiplexer of Figure 6. 

[0103] In MPEG-4, the elementary bitstreams for 
10 each object are independent of other bitstreams. As a 
result, each object is associated with a video object layer 
(VOL) and video object plane (VOP) header. The VOP 
header contains the quantization parameter (OP) that 
was used to encode the object. The QP for each object 
is is later used in the modeling and analysis of the texture 
information. All other bits are stored in the bitstream 
memory 870 until it is time to compose the outgoing bit- 
stream 606 of Figure 6. 

[0104] The most significant difference from other 
20 standards is that MPEG-4 is capable of coding the 
shape of an object. From the VOP layer, we find out 
whether the VOP contains shape information (binary) or 
not (rectangular) 812. If it is a rectangular VOP, then the 
object is simply a rectangular frame and there is no need 
25 to parse shape bits. In the case of binary shape, we need 
to determine 81 3 if the macroblock is transparent or not. 
Transparent blocks are within the bounding box of the 
object, but are outside the object boundary, so there is 
no motion or texture information associated with it. 
30 [0105] The shape scaler 820 is comprised of three 
sub-components: a shape decoder/parser 821 , a shape 
down-sampler 822, and a shape encoder 823. If the 
shape information of the bitstream is not being scaled, 
then the shape decoder/parser is simply a shape parser. 
35 This is indicated by the control information 604 received 
from the R-D shape analysis 611 of the transcoder con- 
trol unit 61 0. Also, in this case, the shape down-sampler 
822 and shape encoder 823 are disabled. When shape 
information is being scaled, the shape decoder/parser 
40 821 must first decode the shape information to its pixel 
domain representation. To reduce the rate for shape, a 
block can be down-sampled by a factor of two or four 
using the shape down-sampler 822, then re-encoded 
using shape encoder 823. The ratio of conversion is de- 
45 termined by the R-D shape analysis 611. Whether the 
shape bits have simply been parsed or scaled, the out- 
put of the shape scaler 820 is transferred to the bit- 
stream memory 870 via the bitstream bus 860. 
[0106] Other than the shape bits, the remainder of the 
so MPEG-4 syntax is somewhat similar to that of MPEG-2 
with a few exceptions. At the macroblock (MB) layer, 
there exist bits that contain the coded block pattern 
(CBP). The CBP is used to signal the decoder which 
blocks of a macroblock contain at least one AC coeffi- 
55 cient. Not only does the CBP affect the structure of the 
bitstream, but the CBP also has an impact on Intra AC/ 
DC prediction. The reason that the transcoder must be 
concerned with this parameter is because the CBP will 
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change according to the re-quantization of DCT blocks. 
For this reason, we re-compute the CBP after the blocks 
have been re-quantized: a CBP re-compute unit 856 of 
the texture scaler accomplishes this. The unit 856 sends 
a variable length code (VLC) 855 to the bitstream mem- 
ory 870 via the bitstream bus 860 to replace the header 
that was present in the input bitstream. 
[0107] After we have parsed the elementary bitstream 
to extract the relevant decoding parameters, we pro- 
ceed to partially decode the texture blocks 851 . The re- 
sult of this process are the DCT block coefficients. If the 
spatial (re-size) analysis is enabled, the object can be 
down-sampled by a factor of two or four. The ability to 
down-sample blocks is indicated by the transcoding 
control unit 610, and the down-sampling factor by the 
spatial analysis 614. Furthermore, this down-sampling 
is performed in the DCT domain so that the IDCT/DCT 
operations can be avoided, see U.S. Patent 5,855,151, 
"Method and apparatus for down-converting a digital 
signal," issued on November 10, 1998 to Bao et al. The 
DCT blocks are then stored temporarily in a coefficient 
memory 853. From this memory, blocks are sent to 
quantizer 854, which quantizes the blocks according to 
the QP sent from the R-D texture analysis 612, which 
uses the techniques described in this invention to meet 
the new target rate. 

[0108] To skip objects, the temporal analysis 613 in- 
dicates to the bitstream composition 880 unit which bits 
are to be composed and sent out. and which bits should 
be dropped. In this way, parts of the bitstream that can 
have been written into this memory will simply be over- 
written by data of a next video object. 

[Implementation & Processing] 



[0109] Regarding a specific embodiment, it should be 
noted that the architecture of transcoder 800 illustrates 
the components for a single object. In the extreme case, 
multiple objects can scaled with multiple transcoders as 
shown in Figure 6. In a software implementation that 
considers multi-thread execution, this can be the most 
efficient way. The challenge in a software implementa- 
tion is to allocated appropriate amounts of CPU 
processing to each object under consideration. 
[0110] However, for hardware implementations, the 
case is very different. Hardware designers usually pre- 
fer to have one piece of logic that handles a specific 
functionality. For example, rather than implementing/^ 
motion parsers for a maximum number of M objects that 
can be received, the hardware design includes a single 
motion parser that operates at a certain speed so that 
multiple objects can be parsed at a given time instant. 
Of course, if the number of objects exceeds the parser's 
throughput, then parallel parsers can still be used. The 
main point is that the number of parsers required can 
be less than the than the total objects that are received, 
and computation is distributed among the parallel pars- 
ers. This notion applies to all sub-blocks of the trans- 



coder 800. 

[Hierarchical Cue Levels] 

5 [0111] We now describe a system where the trans- 
coding is according to features extracted from various 
levels of a video. In general, a video can be partitioned 
into a course-to-fine hierarchy 900 as shown in Figure 
9. A video program or session 910 is considered to be 

io the highest level of the hierarchy 900. This level can rep- 
resent a 30-minute news program or an entire day of 
programming from a broadcast network. The program 
910 includes a sequence of shots Shot-1, Shot-n 
911-919. 

is [0112] The next level 920 is partitioned into shots. A 
"shot" can be a group of frames (GOF's), or a group of 
video object planes (GOV'S) 921-929. This level repre- 
sents smaller segments of video that begin when a cam- 
era is turned and last until the camera is turned off. To 

20 avoid any confusion, we will simply refer to this level as 
the shot-level 920. 

[0113] Shots are composed of the most basic units, 
for GOF's, frames 930, and for GOV'S or video object 
planes (VOP's) 931 . We can also consider another level 
25 below this, which refer to sub-regions 941-942 of the 
frame or VOP. 

[0114] At each level in the video program hierarchy 
900, we apply feature extraction processes 901-904 to 
the video data at each of the levels. Of course, because 
30 the data at each level are arranged in a different manner 
and the relevant features change from level to level, dif- 
ferent feature extraction techniques are applied to each 
level. That is, program level feature are extracted in a 
different manner than frame features. 
35 [01 1 5] In the context of our transcoder, these f eatu res 
represent "hints" or "cues" 905-908 that can be applied 
to the transcoding system. These hints can be either se- 
mantic or syntactic, and can represent either high-level 
or low-level meta-data. 
40 [0116] It should be understood that meta-data can be 
applied to transcodingat any given level. In general, me- 
ta-data for the higher level data, such as shot-level, are 
used for classification, bit allocation and rate-quality 
considerations for that particular shot and among other 
45 shots. For this case, the meta-data are of limited use to 
the transcoder, but very useful to the CND manager 330 
of Figure 3 that decides the transcoding strategy among 
all outgoing content. In contrast, meta-data for lower- 
level data, such as object-level, can be more useful to 
so the transcoder 340 itself to help with dynamic bit-alloca- 
tion because it is difficult to classify and manage outgo- 
ing content at such a low-level. 
[0117] In the following, we describe how low-level fea- 
tures can be clustered (classified) and mapped into 
55 meaningful parameters that are related to the rate-qual- 
ity trade-off. In describing these clustering methods, we 
mainly focus on higher-level classifications of the con- 
tent, but low-level classifications can also be included. 
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Next, a hybrid discrete-summary and continuous-con- 
version transcoder is described. Again, the techniques 
are described with a major focus on using high-level 
(shot-level) meta-data in the CND manager. However, 
we can also consider such meta-data in the discrete- 
summary transcoder. Finally, we describe how to guide 
the transcoding using meta-data. As described, this is 
equally applicable to both the managing and transcod- 
ing stages. 

[Content Classifier: Stage 111] 

[01 1 8] As stated earlier for Figure 3, the main function 
of the content classifier 31 0 is to map features of content 
characteristics, such as activity, video change informa- 
tion and texture, into a set of parameters that we use to 
make rate-quality trade-offs. To assist with this mapping 
function, the content classifier also accepts meta-data 
information 303. Examples of meta-data include de- 
scriptors and description schemes (DS) that are speci- 
fied by the emerging MPEG-7 standard. 
[0119] In stage III 313 of the content classifier 310, 
such low-level meta-data are mapped to rate-quality 
characteristics that are dependent on the content only. 
This is illustrated in Fig. 10. The rate-quality character- 
istics in turn affect the rate-quality functions shown in 
Fig. 5. 

[0120] The content classifier 310 receives low-level 
meta-data 303. Stage 1 311 extracts high-level meta-da- 
ta or classes 1001. Stage II 312 uses the predictions 
321 to determine rate-quality (R-Q) characteristics that 
are content, network, and device dependent. Stage III 
313 extracts R-Q characteristics 1 003 that are only de- 
pendent on low-level meta-data. 
[0121] As an example, we describe how the spatial 
distribution parameters of the motion activity descriptor 
in MPEG-7 enable classification of the video segments 
of a program into categories of similar motion activity 
and spatial distribution. 

[01 22] Consider a news program. The news program 
includes several shots of an anchorperson and a variety 
of other shots thatfurther relate to the overall news story. 
[0123] The examples shown in Figures 11a-b and 
12a-b consider a news program 1200 with three shots 
1 201 -1 203, an anchor person shot, a reporter on scene 
shot, and a police chase shot. For simplicity of the ex- 
ample, we classify all news program shots into only 
three categories, with the understanding, that in a real 
application, the number of categories different in 
number and kind. 

[0124] A first class 1101 represents shots where the 
temporal quality of the content is less important than the 
spatial quality. A second class 1102 represents shots 
where the spatial quality of the content is more impor- 
tant, and a third class 1103 where the spatial and tem- 
poral qualities of the shot are equally important. 
[0125] This set of classes will be referred to as SET- 
1 1110. Such classes are clearly characteristics of rate 



and quality. The objective of stage III 313 of the content 
classifier is to process low-level features and map these 
features into the most suitable of these classes. It should 
be noted that the importance of the spatial and temporal 
5 quality can also be rated on a scale of one to ten, or on 
a real number interval 0.0 to 1 .0. 
[0126] To illustrate these rate-quality classes further, 
consider another set of three distinct classes as shown 
in Figure 11b. A first class 1121 indicates that the shot 
w is very simple to compress, i.e., large compression ra- 
tios can easily be achieved for a given distortion. A third 
class 1123 represents the complete opposite, i.e., the 
content of the shot is very difficult to compress, either 
due to large/complex motion or a spatially active scene. 
is a second class 11 22 is somewhere in between the first 
and third classes. This set of classes will be referred to 
as SET-2 1120. As with the other set of classes 1110, 
these classes 11 20 also illustrate the effects that content 
classification can have on the rate-quality decisions 
20 made by the CND manager 330 and how the switchable 
transcoder 340 can operate. As above, the compression 
difficulty can be classified on a numeric scale. It should 
be understood that other sets of classes can be defined 
for other types of video programs. 
25 [0127] So far, we have described two examples of 
rate-quality classes, SET-1 and SET-2. Content is clas- 
sified into these classes according to the features that 
are extracted from the low-level meta-data 303. In the 
following, we describe how these classes can be de- 
30 rived from motion activity. 

[01 28] For most news program, it is expected that the 
analysis of all anchorperson shots will yield similar mo- 
tion activity parameters, which infer relatively low mo- 
tion. Given this data, and assuming SET-1 1 11 0, we can 
35 classify such content into the second class 1102 (impor- 
tance of spatial quality > temporal quality). Furthermore, 
we can expect that all police chases, and shots of the 
like, be classified into the first class 1 1 01 (importance of 
temporal quality > spatial quality). Finally, depending on 
40 the background activity of the reporter on the scene, this 
type of shot can be classified in any one of the three 
available classes. For the purpose of the example, this 
shot is classified into the third class. 
[0129] Fig. 12(a) illustrates a transcoding strategy ac- 
45 cording to the classification of SET-1 . The anchorperson 
shot 1201 is transcoded using a discrete summary 
transcoder 1210, see block 441 of Figure 4. This trans- 
coder reduces the entire shot 1201 to a single frame 
1211, i.e., a still picture of the anchorperson. For the du- 
50 ration of the shot, the entire audio portion of the anchor- 
person talking is provided. 

[0130] The reporter on the scene shot 1202 is contin- 
uously converted at five frames per second 1221 with 
full audio to preserve some sense of motion in the back- 
55 ground to the viewer. 

[01 31 ] The police chase shot 1 203 is also continuous- 
ly converted 1230 at thirty frames per second 1231 . 
[0132] In any case, whether the content classifier is 
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given access to meta-data that described the content, 
or the classifier derives the data directly from the content 
itself, the utility of this information can directly be under- 
stood in view of the rate-quality trade-offs that the CND 
manager must ultimately make. 
[0133] In contrast with the above example, if we as- 
sume the same program 1200 and SET-2 1120 classifi- 
cations instead, then the classification results can be in- 
terpreted differently as shown in Figure 13. With SET- 
2, the lack of motion in the anchorperson shot 1201 
makes the segment very easy to compress, hence it is 
classified into the first class 1121 of Set-2. This shot is 
continuously converted 1240 with high compression at 
thirty frames per second 1241. The police chase shot 
1203, however, contains high motion and is more diffi- 
cult to compress. Therefore, it is classified into the third 
class 1123 of Set-2. It is continuously converted 1260 
at 7.5 frames per second 1 261 . Again, depending on the 
characteristics of the shot 1202 with the on-scene re- 
porter, it can fall into any one of the three classes. For 
purpose of the example, it is assigned to the second 
class 1122, and continuously converted 1250 at 15 
frames per second 1251 . 

[0134] It should be noted, that the hints can produce 
either a constant or variable rate bit stream (CBR or 
VBR). For example, if the classification is based on com- 
pression difficulty (SET 2), then a CBR bit stream can 
be produced when a low-frame rate is imposed on a dif- 
ficult to compress sequence of frames, and a VBR bit 
stream when more bits are allocated. 
[0135] In the following paragraphs, we describe how 
these different classifications can be used to generate 
a transcoding strategy. 

[Hybrid Continuous-Conversion and Discrete 
-Summary Transcoding] 

[0136] It should be emphasized that the rate-quality 
mapping implied by each class can vary widely depend- 
ing on the specific application. In the above examples, 
we illustrated that the spatial and temporal quality can 
be affected by the difficulty to compress a video or the 
level of priority assigned to the spatial and temporal 
quality. Both classifications were derived from low-level 
features. 

[01 37] To the CND manager 330, these classifications 
suggest ways in which the content can be manipulated. 
In fact, classification can significantly reduce the 
number of scenarios to consider. For instance, if the 
CND manager has to consider the rate-quality trade-offs 
for multiple bit streams (frames or objects) at a given 
time instant, then the CND manager can consider the 
best way to distribute transcoding responsibility be- 
tween continuous-conversion and discrete-summary 
transcoding. Rather than choosing one way for all seg- 
ments under consideration, it is also possible to consid- 
er a hybrid scheme. Priorities of the program, or com- 
pression difficulties according to its low-level features, 



are examples of useful parameters that can be used to 
make such decision. 

[0138] Figs. 12(a) and (b) illustrate how the classifi- 
cations in SET-1 1110 and SET-2 1120 affect the strat- 
5 egy determined by the CND manager and the way in 
which the transcoder manipulates the original data. Of 
particular interest in Figure 12 is that a hybrid transcod- 
ing scheme is employed. 

[0139] Going back to our example of the news pro- 

10 gram 1200, and considering SET-1 classifications, we 
can assign the anchorperson shot a lower priority than 
the police chases. If we are dealing with object-based 
video, then another way to transcode is to assign the 
background of shot 1201 a lower priority than the an- 

15 chorperson in the foreground. This can all be accom- 
plished through classification or classifications of ob- 
ject-level motion activity parameters, for example. 
[0140] We have used the motion activity to illustrate 
these concepts. However it should be understood that 

20 other low-level features or MPEG-7 descriptors such as 
shape parameters, texture information, etc., can also be 
used. Whether low-level features are considered indi- 
vidually or in combination, they can be used to effective- 
ly cluster and classify video content into meaningful pa- 

25 rameters that assist the CND manager and the trans- 
coders. 

[01 41 ] It may appear that the CND classifier 31 0 and 
CND manager 330 conflict with the TCU 610 of Figure 
6, but this is not the case. The classifier and CND man- 

30 ager attempt to pre-select the best strategy for the trans- 
coder 340. Given this strategy and instruction from the 
manager, the transcoder is responsible to manipulate 
the content in the best way possible. In the event the 
transcoder cannot fulfill the request due to erroneous 

35 predictions, or a chosen strategy by the CND manager, 
the transcoder still needs mechanisms to cope with such 
situations, such as temporal analysis. Therefore, meta- 
data can also be used in the TCU . However, the purpose 
of the meta-data fortheTCU is different than fortheclas- 

40 sifier and CND manager. 

[Effects of Meta-Data on Transcoding] 

[0142] There are two ways that meta-data can affect 
45 transcoding. Both are directly related to the bit allocation 
problem described above. The first way is in the CND 
manager 330 where the bit allocation is used to derive 
a strategy and ultimately a decision on how to use the 
functions provided by the Discrete-Summary and Con- 
so tinuous-Conversion Transcoders 441-442. In this way, 
the rate-quality functions of Figure 5 are used for deci- 
sion making. The second way is in the transcoder 340 
itself. Again, the meta-data are used for estimation, but 
rather than making decisions on strategy, the meta-data 
55 are used to make real-time decisions on the coding pa- 
rameters that can be used to meet the bit-rate objec- 
tives. In this way, the coding parameters are chosen so 
that the transcoders achieve the optimal rate-quality 
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functions of Figure 5. 
[0143] In general, low-level and high-level meta-data 
provide hints to perform discrete-summary and contin- 
uous conversion transcoding. These hints are useful to 
both the CND manager and the transcoder. To illustrate, 
we first consider high-level semantic information asso- 
ciated with the content. The semantic information can 
automatically be associated with the content or by man- 
ual annotation. 

[0144] Take the case where a database stores a 
number of video programs. The videos have been rated 
according to a variety of categories, e.g., level of "ac- 
tion." In an application where multiple users requestvar- 
ious shots simultaneously, the CND manager 330 must 
decide how much rate is allocated to each shot. In the 
discrete-summary transcoder 441 , this rate can corre- 
spond to the number of frames that are sent, whereas 
in the continuous-conversion transcoder 442, the rate 
can correspond to the target frame-rate that is accept- 
able. Given that the level of action indicates a certain 
level of temporal activity, bits can be allocated per frame 
sequence according to the description of the content. 
For shots with high action, the CND managers deter- 
mines that a frame-rate less than a predetermined level 
is unacceptable for the continuous-conversion trans- 
coder, and that a better quality shot can be deliver by 
summarizing the content with the discrete-summary 
transcoder. 

[0145] Within the discrete-summary transcoder, we 
can also consider the number of frames that are accept- 
able to achieve a reasonable level of perceptual quality. 
Going back to the low-level motion activity descriptor, it 
can be reasoned that video sequences having associ- 
ated activity parameters that imply low motion, intensity 
can be summarized with fewer frames than those shots 
with activity parameters that imply high motion intensity. 
As an extension to this, it can easily be understood how 
such bit allocations can be applied at the object-level as 
well. 

[Generating High-Level Meta-Data from Low-Level 
Meta-Data ] 

[0146] The process of generating high-level meta-da- 
ta from low-level meta-data can be def ined as meta-data 
encoding. Such an encoding process can be considered 
at Stage I 311 in the content classifier of out transcoding 
system. 

[01 47] Additionally, this high-level generation process 
can be used in a stand-alone system. An example of 
one such stand-alone system is a system that instanti- 
ates description schemes specified by the MPEG-7 
standard. One can call such a system an MPEG-7 high- 
level meta-data encoder. 

[0148] In the current MPEG-7 Working Draft, there 
are high-level description schemes that are placehold- 
ers for various types of meta-data. It should be noted 
that normative parts of the standard explicitly define re- 
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quirements essential to an implementation; informative 
parts only suggest potential techniques or one way of 
doing something. In MPEG-2, determining suitable mo- 
tion vectors or quantization parameters are considered 

5 an encoder issue, hence informative parts of the stand- 
ard. The standard does specify variable-length coding 
(VLC) tables for the motion vector, and a 5-bit field for 
the quantization parameter. How these fields are used 
is strictly an encoder issue, and of no concern to the 

10 standard, hence informative. 

[0149] In MPEG-7, the normative and informative 
fields of the various description schemes are in a similar 
situation. The fields have been specified, but how one 
generates data for these fields is informative. Fortrans- 

15 coding and summarization, we consider various de- 
scription schemes that have been specified in the 
MPEG-7 Working Draft, for example, the SummaryDS, 
the VariationDS, HierarchicalSummaryDS, Highlight- 
SegmentDS, ClusterDS, and ClassifierDS, see, ISO/ 

20 EC JTC N3113, "MPEG-7 Multimedia Descriptor 
Schemes WD," December 1999, for additional descrip- 
tor schemes. 

[01 50] For example, the SummaryDS is used to spec- 
ify a visual abstract of the content that is primarily used 
25 for content browsing and navigation, and the Varia- 
tionDS is used to specify variations of the content. In 
general, the variations can be generated in a number of 
ways and reflect revisions and manipulations of the orig- 
inal data. However, such description schemes as the 
30 SummaryDS and the VariationDS do not describe how 
to summarize or generate variations of the content. 
[0151] These description schemes simply include 
tags or fields of information that provide a system with 
information on the "properties" of the summarized con- 
35 tent or variation data, "where" the content can be found, 
and "what" operations can have been performed on it, 
etc. This implies that all manipulations have been done 
prior to transmission. Where such fields do exist, the 
task of the CND manager is simplified because the man- 
40 ager is handed a list of available summaries or pre- 
transcoded data with associated properties. 
[0152] Although there are advantages in having this 
information available, such as a simplified CND manag- 
er and transcoder, there are two major problems. The 
45 first major problem is that these variations must be gen- 
erated prior to any request for the original video. As a 
result, real-time transmission is not an option because 
the delay associated with generating multiple variations 
of the content is too long. The second major problem is 
so that network characteristics are likely to change over 
time. Therefore, choosing a specific pre-transcoded var- 
iation at one time instant under current network condi- 
tions cannot hold for the entire duration. 
[01 53] Despite these disadvantages, the standard will 
55 not specify how to fill the fields in these description 
schemes. These are encoder issues for the MPEG-7 
standard. 

[0154] Assuming a non-real-time transmission appli- 
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cation, we describe a system to generate the contents 
of high-level fields in the description scheme syntax us- 
ing low-level descriptors. 

[Variations of Content] 

[0155] Essentially, the same methods that are used 
for real-time transcoding can also be used to generate 
summaries and variations of a particular video. Off-line, 
various network conditions can be simulated, and pro- 
gram content can be transcoded according to the vari- 
ous simulated conditions. The resulting content can be 
stored in a database. In executing this pre-transcoding, 
not only should the network conditions : such as availa- 
ble bandwidth be noted, but the system should also note 
the way in which the data are manipulated. This type of 
information will populate the fields of the description 
scheme. 

[High-Level Meta-Data Encoder for Video Programs] 

[0156] An illustration of such an encoder that gener- 
ates summaries and variation data along with associat- 
ed instantiations of corresponding description schemes 
is shown in Fig. 13. The components of the encoder re- 
sembles those of the adaptable transcoding system 300 
of Figure 3. However, the encoder is different in that it 
is not connected to a network to receive and transmit in 
real-time while transcoding. Instead, the encoder is con- 
nected to a database where video are stored. The en- 
coder generates, off-line, various versions of the video 
for later real-time delivery. 

[0157] As shown in Figure 14, an adaptable bitst re am 
video delivery system 1300 is includes five major com- 
ponents: a content classifier 1310, a network-device 
(ND) generator 1320, a CND manager 1330, a switch- 
able transcoder 1340 and a DS instantiator 1350. The 
system 1 300 has its input and output connected to a da- 
tabase 1360. The system 1300 also includes a selector 
1 370 connected to the network and the database 1 360. 
[01 58] An object of the delivery system 1 300 is to gen- 
erate variation and/or summary bitstreams 1308 from 
an original compressed bitstream (Video In) 1301. The 
content of the bitstream can be visual, audio, textual, 
natural, synthetic, primitive, data, compound, or combi- 
nations thereof. 

[0159] As noted earlier, the video delivery system 
1300 resembles the adaptable transcoder system 300. 
The major difference are that it is not connected to a 
user device 360 via the network 350 of Figure 3, and the 
transcoding is not performed in real-time. The ND gen- 
erator 1350 replaces the device and network. 
[0160] Essentially, the generator is responsible for 
simulating network and device (ND) constraints such as 
would exist in a real-time operation. For instance, the 
ND generator can simulate a CBR channel with 64kbps, 
128kbps and 512kbps, or a VBR channel. Additionally, 
the generator can simulate a channel that is experienc- 



ing a decrease in available bandwidth. This loss can be 
linear, quadratic, or very sharp. Many other typical con- 
ditions can be considered as well; some conditions can 
relate to user device constraints, such as limited display 

5 capabilities. 

[0161] Each of these different conditions can result in 
a different variation of the original input video 1301 . In 
essence, the database will store a large number of var- 
iations of the input bitstream 1301 , so that in the future, 

10 a bit stream for some real-time operating condition will 
be readily available to the downstream transcoders. The 
variation bitstreams can be both CBR and VBR. 
[0162] The purpose of the ND generator 1320 is to 
simulate various network-device conditions and to gen- 

15 erate the variations/summaries 1 308 of the original con- 
tent 1301 in an automatic way according to these con- 
ditions. While doing this, the system also instantiates 
corresponding description schemes 1309. 
[0163] Because the fields of the description scheme 

20 (e.g., VariationDS and SummaryDS) need to be filled 
with properties of the variation bitstream 1308 and the 
method that has been imposed to manipulate it, the 
CND manager must pass this information to the DS In- 
stantiator 1350. After a variation has been instantiated, 

25 the corresponding description scheme can be accessed 
and used, for example, by the real-time transcoder 300 
as described above. 

[Rate-Quality Functions] 

30 

[0164] As shown in Figure 15, the variations and/or 
summaries 1 308 that are produced by the system 1 300 
are a subset of points V(1), V(5) on an optimal rate- 
quality function 1401. In Figure 15, a finite number of 

35 points are shown. These points represent the optimal 
operating point for particular variations. Each variation 
has an associated instantiated description scheme (DS) 
1309. Both the variation bitstreams 1308 and the instan- 
tiated description schemes 1309 are stored in the data- 

40 base 1360, along with the original video stream 1301 . 
[0165] In a typical application, the selector 1370 of 
system 1300 receives a request for a particular video 
program. In response, the selector provides information 
on the available variations and associated DS stored in 

45 the database 1 360. The CND manager of the transcoder 
300 makes use of this pre-transcoded data. The high- 
level meta-data allows the transcoder to associate a 
particular variation of the requested video with current 
real-time network and device constraints. If a suitable 

so match is found, then the CND manager requests that 
particular variation to be transmitted over the network 
350 by the selector. If a suitable match is found, then 
the transcoder 340 can operate in a by-pass mode. If 
close match is found, then the transcoder 340 can op- 

55 erate more efficiently. 

[0166] This is only one practical example application. 
It is also possible to further manipulate and alter the al- 
ready manipulated bitstreams 1308 to increase the 
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match with current network and device constraints. This 
becomes a matter of generating a large number of pre- 
transcoded bitstreams that cover a very wide range of 
conditions versus generating a few pre-transcoded bit- 
streams that cover some of the most common condi- 
tions. Different levels of quality can be expected from 
each approach because transcoding by the delivery 
system 1300 under relaxed time constraints will gener- 
ally lead to a better quality video. 
[0167] Although the invention has been described by 
way of examples of preferred embodiments, it is to be 
understood that various other adaptations and modifi- 
cations can be made within the spirit and scope of the 
invention. Therefore, it is the object of the appended 
claims to cover all such variations and modifications as 
come within the true spirit and scope of the invention. 



a selector, coupled to the network and the da- 
tabase, configured to a select a particular one 
of the output compressed videos in response 
to a request. 

5 

6. The apparatus of claim 1 wherein the of plurality of 
output compressed videos include CBR bit streams 
and VBR bit streams. 

10 7. The apparatus of claim 4 further comprising: 

means for partitioning the compressed video in- 
to a plurality of hierarchical levels; 
a feature extractor configured to extract fea- 
rs tures from each of the plurality of hierarchical 
level the features to be combined with each of 
the descriptor schemes. 



Claims 

1 . An apparatus for transcoding a compressed video, 
comprising: 



8. A method for transcoding a compressed video, 
20 comprising the steps of: 

simulating a plurality of constraints of a network 
and constraints of a user device; 
generating content information from features of 
an input compressed video; 
producing a plurality of conversions modes de- 
pendent the constraints and content informa- 
tion; 

producing an output compressed video for 
each of the plurality conversion modes. 



a generator configured to simulate a plurality of 
constraints of a network and constraints of a us- 25 
er device; 

a classifier, coupled to receive an input com- 
pressed video and the plurality of constraints, 
configured to generate content information 
from features of the input compressed video; 30 
a manager, coupled to the classifier and the 
generator, configured to produce a plurality of 
conversions modes dependent the constraints 
and content information; and 
a transcoder, coupled to the classifier and the 35 
manger, configured to produce a plurality of 
output compressed videos, one for each of the 
plurality conversion modes. 



2. The apparatus of claim 1 wherein the content of 40 
compressed video is selected from the group con- 
sisting of visual, audio, textual, natural, synthetic, 
primitive, compound data, and combinations there- 
of. 

45 

3. The apparatus of claim 1 further comprising: 



a database for storing the input compressed 
video and the plurality of output compressed 
videos. 50 

4. The apparatus of claim 1 further comprising: 

an instantiator, coupled to the manger, config- 
ured to generate a descriptor scheme for each 55 
of the plurality of output compressed videos. 

5. The apparatus of claim 1 further comprising: 



17 



<EP 1248466A1_I_> 



EP 1 248 466 A1 



FIG.1 

100 



1 



Compressed 
Bitstream at 




TRANSCODER 




• 

: Compressed 
i Bitstream at 


Rate Rin 




Decode 




Re-Encode 


.: Rate Rout 


i 


■ — ► 


— ► 


— 1 7 * 


101 • 




u 




j 


1 


j 103 

• 






110 


New Rate 
Rout 


120 



102 



FIG. 2 







210 
f 




240 
) 
















VLD 




Bit Allocation 














Parser 


>■ 


Analyzer 


\ 7 




260 
? 


















Reduced Rate 






Delay 




VLD & 




Rate Controller 


— >■ 


VLC 


Bits-Out w 


/ 


— ► 


> 


IQ 


>■ 


Re-Quantize 




I > 


201 




I 

220 


100 


i f 

230 New Rate ! 

1 
202 






203 



BNSDOCID: <EP 1248466A1_I_> 



18 



EP 1 248 466 A1 



O 
LL 




CM 
' o 

C3 



20 



BNSDOCID: <EP_ 



_1248466A1_1_> 



EP 1 248 466 A1 




BNSDOCID: <EP 1248466A1_I_> 



21 



EP 1 248 466 A1 



FIG. 6 



601 



>1 



605 



DE- 
MUX 



607- 



604 
I 

I Control 
•Information 



Object 1 
Transcoder 



,800 



Object 2 
Transcoder 



.800 



Qbject N 
Transcoder 



800 



602 



MUX 



Transcoding Control Unit 



610 



Object Data 
600 



R-D Shape Analysis 



R-D Texture Analysis 



.611 



-,612 
-< 



Temporal Analysis 



Spatial Analysis 



.613 
,614 



603 

; 



rr 

606 



Rate 

Feedback 
. 608 



» 



BNSDOCID: <EP 1 248466A1 _l_> 



22 



EP 1 248 466 A1 




EP 1 248 466 A1 



FIG. 8 



670 



Elementary 810 
Input , L. 



880 



Bitstream 



801 



VOLA/OP 
Parser 



Bitstream 
JWemory^ 


— > 


Bitstream 
Cornposittior 
Unit 





, Rectangular 

ji 



Binary 
811 



812- 



Not Transparent 



"8"2i; ' 



820 
.1. 



Shape 
Decoder 



2? 

arent) 



(open) 
813 



Shape Scaler 

830 

? 





Shape 
Down- 
Sampler 


822 • 
J \ 




r 


823! 


Shape 
Encoder 







Elementary 

Output 

Bitstream 



T 
809 



•860 



MB Header 
Parser 



Intra . 

I Jcr 



Texture Scaler 



Coded 



3 



Inter 



Not Coded 
(open) 

840 

? 



Motion 
Parser 



800 



850 
-A. 



CBP 
Re-Compute 
Unit 



851 

? 



Partial Texture 
Decoder 









856 




Texture 




Down - 




Sampler 


852 



VLC v.855 



Quantizer Njfi54 



I 



Coefficient 
Memory^ 



-853; 



BNSDCCID: <EP 1248466A1_L> 



24 



EP 1 248 466 A1 



a* 




25 



BNSDOCID: <EP_ 



_1248466A1_i_> 



EP 1 248 466 A1 



CD 
LU 




CO 

o , 
o 



CNJ 

'co 




N 



H 



£ ra 
a> a 

Si 



CO 



CO 
' CO 



CO 

.o 

CO 



26 

BNSDOCID: <EP 1248466A1_I_> 



EP 1 248 466 A1 



o 

eg 







co 








CO 


to 




to 


09 


to 




CO 


k. 


£ 




Q 


Q. 








Com 


CL 


ripto 


omp 


Com 


o 


O 


o 


to 






o 


CD 


o 


eratG 


Q 


Easy! 


icult 




Mod 


Q 


CO 








CO 








co 


t— 


Cvl 


CO 


O 










\ 


\ 

CVJ 


v 

CO 






CM 


CVJ 



















Ll_ 



CO S 



o 

U 
CO 
CD 

Q 



5 C 3 

J w o 

03 — =" 

QL E 



— CD CO 

IS "5 o 

OL D) E 
C7> jCO 



1 3 

_ 3 CO 

JS §" o 
c E 

0) j0) 



CVJ 

o 



Y 

CO 

o 



BNSDOCID: <EP 1248466A1_I_> 



27 



EP 1 248 466 A1 



CO 
CM' 



CVJ 

8' 

CM *- 



CD 

o 
m 
x: 
O 

(D 

_o 
"5 

CL 



2n _ 



g 

id- 



s' 



»~ CD 
05 C 

O CO 
Q_ CO 
Q> — 

cn o 



< CL 



7 

o 
o 

CM 



CO 



o 



CO 
CO 




03 

H ^ ^ 

CD ^ 

CO 

00 



03 
CO 



61 ~ 



CD 
CD 

cc 

— • 





V2\ 



v CM 

r cm 




28 



BNSDOCID: <EP_ 



_1248466A1_I_> 



EP 1 248 466 A1 



CO 

8- 



CD 
L> 
CO 

J= 

O 

0) 
o 

2 



CO ^ 

CD 

Li- 



fe "» 

2. c 

T m 

o o 

CL CO 
CD 

t= O 



8' 



J g 



CO 
CM 



CVJ 



© 

o 
cn 

(0 

S3 



CM 

co E 

■ .3 
H T3 CM 
^ CD 

CO 

55 



CD 

CO 



Ml- 



CO 

5 



o 

CD 

cm rsL 



1 s 

c © 

o o 

O O 



o 

cm rsr 



§2 

C G> 
5= > 

c c 
o o 
O O 



yep 



CM 




CM 



29 



BNSDOCID: <EP 124a466A1_l_> 



EP 1 248 466 A1 




BNSDOCID: <EP_ 1248466A1_l_> 



EP 1 248 466 A1 




BNSDOCID: <EP 1248466A1_I_> 



31 



EP 1 248 466 A1 



INTERNATIONAL SEARCH REPORT 



I International application No. 
PCT/JP01/ 00662 




A. CLASSIFICATION OF SUBJECT MATTER 
Int. CI 7 H04N7/24 



mt.cr 



man searched (classtltcanon syscm r~ ^4^/41-1/419 

H04N7/24-7/68, H04N5/91-5/956 , H04N1/41 1/^? 




C. DOCUMENTS CONSIDERED TO BE RELEVANT 



Category* 



E, A 



E, A 



Citation of document, with indication, where appropriate, of the relevant passages 



JP, 2001-94994, A (Canon inc.) , 
06 April. 2001 (.06-04.01). 
Pull text; Pigs. 1 to 8 (Family: none) 

JP, 2001-86460, A (BEC Corporation) , 

30 March, 2001 (30.03.01), 

Pulitexi; Pigs. 1 to 7 (Family: none) 

JP 11-252546, A (Hitachi, Ltd.), 

1?' September, 1999 (17.09.99), 

Pull text; Pigs. 1 to 14 (Family: none) 

JP 8-237621, A (AT & T Corporation), 
13* September, 1996 (13.09.96), 
Pull text; Figs. 1 to 10 
&EP. 711080, A2 & US, 5629736. A 
& CA, 2159847, C 

JP 10-164143, A (Hitachi, Ltd.), 

19' June, 1998 (19.06.98), 

Full text; Figs, l to 5 (Family: none) 



P| ^ Omenta arc listed in the c^umubon of Box C. □ Set patent family aoncx. 



Relevant to claim No. 



1-8 



1-8 



1-8 



1-8 



1-8 



priority date and not to conflict with the arjpfcaDon but cned * 
a m^^Ii. m thearv uaderiying the raveanon 



Special categories of cited documents: ricriry date and not ra comma »n * 

•A' *!a^<lefirnngt!ie^ SaStand ibe principle or theory uaderiy^ the nnrcancn 

considered to beof particular relevance docttr ^t of particular reiev^ 

-T ear*** document but published on « after the mtenuoonal filing X" ll^^ni^^ 

^ to establish the rniMsaiian d*c of another crtaoon or other 1 ^^^^^ toicve ^^ wfemtr^doc^iia 

special reason (a* specified) , K.>;nn norC ,Wr carianedwhii one or more other ^do«tDenB,su* 

«0- taanTOreferrmgmm c^^^o^bek^obvic^ to a p<r.on skilled mih c«rt 

owns .Bis—j-.wrtUtP. "A" document inember of the same patent fanrity 

dconnent published priorm the iiiiemar^ Wir« date bol later * aDcnDB11 

than die priority date claimed 



umuu-y-"-/ — 

Date of the actual completion of the intcnatkmal search 
27 April, 2001 (27.04.01) 



Name and mailing addnras of the ISA/ 

Japanese Patent Office 

Facsimil e No. 

Form PCT/ISA^IO (second sheet) (July 1992) 



Date of mailing of the international search report 
15 May, 2001 (15.05.01) 



Authorized officer 
Telephone No, 



32 



BNSDOCID: <EP 1248466A1_I_> 



EP 1 248 466 A1 



INTERNATIONAL SEARCH REPORT 



International application No. 

PCT/JP01/00662 



C(ComnnarionV DOCUMENTS CONSIDERED TO BE RELEVANT 



Category* 



Citation of document, with indication, where appropriate, of the relevant passages 



Relevant to claim No. 



E,A 



E.A 



E,A 



JP, B-237653, A (AT & T Corporation), 

13 September, 1996 (13.09.96), 

Full text; Figs. 1 to 8 

& EP, 719055, A2 & US, 5623312, A 

& CA, 2164751, C 

JP, 2000-155436, A (Tektronix Inc.), 

16 June, 2000 (16.06.00), 

Full text; Figs. 1 to 3 

& EP, 1001582, A2 & CN, 1254151, A 

JP, 2001-103425, A (Victor Company of Japan, Limited), 

13 April, 2001 (13.04.01), 

Full text; Figs. 1 to 2 (Family: none) 

JP, 2001-94980, A (Sharp Corporation) , 

06 April, 2001 (06.04.01), 

Full text; Figs. 1 to 15 (Family: none) 

JP, 2001-69502, A (Toshiba Corporation), 

13 March, 2001 (13.03.01). 

Full text; Figs. 1 to 26 (Family: none) 



1-8 



1-8 



1-8 



1-8 



1-8 



Form PCT/1SA/210 (continuation of second sheet) (July 1992) 



33 



BNSDOCID: <EP 1248466A1_I_> 



